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Abstract 

Gene expression data matrices often contain missing expression values. In this pa- 
per, we describe a new algorithm, named improved fixed rank approximation algorithm 
(IFRAA), for missing values estimations of the large gene expression data matrices. 
We compare the present algorithm with the two existing and widely used methods for 
reconstructing missing entries for DNA microarray gene expression data: the Bayesian 
principal component analysis (BPCA) and the local least squares imputation method 
(LLS). The three algorithms were applied to four microarray data sets and two syn- 
thetic low-rank data matrices. Certain percentages of the elements of these data sets 
were randomly deleted, and the three algorithms were used to recover them. In con- 
clusion IFRAA appears to be the most reliable and accurate approach for recovering 
missing DNA microarray gene expression data, or any other noisy data matrices that 
are effectively low rank. 

Index Terms-Gene expression matrix, singular value decomposition, principal 
component analysis, least squares, missing values imputation, Bayesian analysis, 
K-nearest neighbor. 

1 Introduction 

DNA microarrays are used as a tool for analyzing information in gene expression data over a 
broad range of biological applications such as cancer classification [9], cancer prognosis [13] 
and identifications of cell cycle- regulated genes of yeast [14]. During the laboratory process, 
some spots on the array may be missing due to various factors (for example, machine error.) 
Because it is often very costly and time consuming to repeat the experiment, molecular 
biologists, statisticians, and computer scientists have made attempts to recover the missing 
gene expressions by some ad-hoc or systematic methods. 

Microarray gene expression data is often represented as a gene expression matrix G = 
{gij)^'JZi with n rows, which correspond to genes, and m columns, which correspond to 
experiments. Thus gij is the expression of the gene i in the j — th experiment. Typically n 
is much larger than m. In this setting, the analysis of missing gene expressions on the array 
would translate to recovering missing entries in the gene expression matrix values. 

In the last six years there have been at least six published papers in the literature 
discussing the problems of missing gene expression data and algorithms to recover them: the 
Bayesian principal component analysis (BPCA) [12] ; the fixed rank approximation algorithm 
(FRAA) [5]; the weighted K-nearest neighbors (KNNimpute) [15]; the least squares principal 
(LSP) [2]; the local least squares imputation method (LLS) [10]; the projection onto convex 
sets methods (POCS) [7]. 
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The purpose of this paper is to introduce the improved fixed rank approximation algo- 
rithm (IFRAA). We compare IFRAA with BPCA and LLS, since the software programs 
for implementing these methods are easily available. We have omitted comparison with 
KNNimpute, since the simulations of [12] and [10] show that BPCA and LLS are superior 
to KNNimpute. 

KNNimpute and LLS are local methods, which use similarity structure of the data 
to impute the missing values. KNNimpute uses the weighted averages of the nearest 
uncorrupted neighbors. LLS has two versions to find similar genes whose expressions are not 
corrupted: the L2-norm and the Pearson's correlation coefficients. After a group of similar 
genes C are identified, the missing values of the gene are obtained using least squares applied 
to the group C. In these two methods, the recovery of missing data is done independently, 
i.e. the estimation of each missing entry does not influence the estimation of the other 
missing entries. 

BPCA is a global method consisting of three components. First, principal component 
regression, which is basically a low rank approximation of the data set is performed. Second, 
Bayesian estimation, which assumes that the residual error and the projection of each gene 
on principal components behave as normal independent random variables with unknown 
parameters, is carried out. Third, Bayesian estimation follows by iterations based on the 
expectation-maximization (EM) of the unknown Bayesian parameters. 

IFRAA is a combination of FRAA, developed in [5], and a good clustering algorithm. 
One first applies FRAA, whose description is below, to complete the missing data. Then 
one applies a clustering algorithm to group the data to a small number of clusters of data 
with similar characteristics. In each cluster FRAA is applied again to update the estimated 
values of missing entries in the cluster. 

FRAA is a global method which finds the values of the missing entries of the gene 
expression matrix G, such that the obtained G minimizes the objective function fi{X). 
Here /; {X) is the sum of the squares of all but the first I singular values of an n x m matrix 
X . The minimum of fi{X) is considered on the set X, which is the set of all possible choices 
of matrices X = (a;^)"^'^!, such that Xij = gij if the entry is known. The completion 
matrix G is computed iteratively, by a local minimization of fi{X) on X. 

The estimation of missing entries in FRAA is done simultaneously, i.e., the estimation 
of one missing entry influences the estimation of the other missing entries. 

2 Mathematical descriptions of FRAA and IFRAA 

Let G be the n x m gene expression matrix, where n> m. Assume first that G does not have 
missing entries. Recall the singular value decomposition of G := U'EV'^, called SVD, [8]. Let 
f 1 > cr2 > . . . > (7m > be the m singular values of G, which are the nonnegative roots of the 
eigenvalues of G^G. Let Ui, . . . , € M" and Vi, . . . , G be the column orthonormal 
eigenvectors of GG^ and G^G corresponding to the eigenvalues ai,...,(T^ respectively. 
Ui, . . . , u„i and vi, . . . , Vm are called the left and the right orthonormal singular column 
vectors of G. Then U is n x m matrix, with the columns Ui, . . . Um, V is an m x m matrix, 
with columns vi, . . . , v^, and S is the diagonal mxm matrix, with cti, . . . , CTm on the main 
diagonal. Thus G = X^I^i ^i'^i'^J ■ Ui, . . . , and vi, . . . , are the principal directions 
of the matrices GG^ and G^G respectively. The rank r of G is equal to the number of 
positive singular values of G. For each 1 < Z < r, the matrix Gi := fJiUjV^ is the best 
n X m approximation matrix of rank I. That is if A is any n x m matrix of rank I at most, 
than ||G — > ||G — G;||^. (||G||:f is the Euclidean norm of G viewed as a vector with 
nm coordinates.) An integer I e [l,r] is called the effective rank of G, if / is the smallest 
integer for which is much smaller than 1. Then G; is called the filtered G, and Gi can 
be viewed as the noise reduction of G. 

In microarray analysis of the gene expression matrix G, the vectors Ui , . . . , are called 
eigengenes, the vectors vi , . . . , are called eigenarrays and cti , . . . , am are called eigen- 
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expressions. The effective rank I of G can be viewed as the nnmber of different biological 
functions of n genes observed in m experiments. The eigenarrays vi, . . . , v; give the prin- 
cipal I orthogonal directions in corresponding to a\,. . . ,ai. The eigengenes Ui , . . . , 
give the principal I orthogonal directions in R" corresponding to (Ti , . . . , cr; . The cigen ex- 
pressions describe the relative significance of each bio-function. From the data given in [1], 
one concludes that the number of significant singular values never exceeds The essence 
of the FRAA algorithm is based on this observation. 

Computationally, one brings G to an upper bidiagonal matrix A using Householder ma- 
trices. Then one applies implicitly the QR algorithm to A^A to find the positive eigenvalues 
af, ...,(7^ and the corresponding orthonormal eigenvectors vi, ...,Vr of the matrix G'^G [8]. 

Assume now that G is the gene expression matrix with missing data. We can estimate the 
effective rank of G by computing the effective rank of the submatrix hxm, corresponding to 
all genes with uncorrupted entries [5, §2 ]. Let I be our estimate for the effective rank of the 
completed gene expression matrix. Denote by X the set of all n x m matrices whose entries 
coincide with the uncorrupted entries of G. Thus X is the set of all possible completion 
of the corrupted gene matrix G. FRAA completes the missing values of G by finding the 
minimum to the following optimization problem: 

m m 

min ^ii^f = ^liere G* G X. (2.1) 

i=l+l i=l+l 

Ideally, G* is the completion of the gene matrix expression with missing values. In practice, 
FRAA uses the following iterative procedure: 

Fixed Rank Approximation Algorithm: Let Gp G X be the p*^ approximation to 
a solution of optimization problem (2.1). Let Ap := GpGp and find an orthonormal set of 
eigenvectors for Ap, Vp i, Vp,„i. Then Gp+i is a solution to the following minimum of a 
convex nonnegative quadratic function miiixex 
iXvp^,)^{Xvp^,). 

The flow chart of this algorithm can be given as: 



Fixed Rank Approximation Algorithm (FRAA) 

Input: integers m,n, L,iter, the locations of non-missing entries <S, initial approxi- 
mation Go of n X m matrix G. 
Output: an approximation Guer of G. 
for p = to iter — 1 

- Compute Ap := GjGp and find an orthonormal set of eigenvectors for Ap, 

- Gp+i is a solution to the minimum problem (2.1) with L = I. 

In each step of the algorithm we decrease the value of fi{X): fi(Gp) > /((Gp+i). Hence 
the sequence Gp,p = 1,... converges to a critical point G. Thus FRAA gives a good 
approximation of G. In many simulations we had we confirmed that G — G* . 

Consider the following inverse eigenvalue problem (lEP): Find the values of the missing 
entries of G such that the nonnegative definite matrix G^G will have m — I smallest eigen- 
values equal to zero. lEP appear often in engineering. See [6] for examples of lEP and a 
number of good algorithms to solve these problems. In fact, FRAA is based on one of the 
algorithms for the inverse eigenvalue problems discussed in [6]. 

As pointed out in [5] FRAA is a robust algorithm which performs good, but not as well 
as KNNimpute. The reason of the superiority of KNNimpute lies in fact that it reconstruct 
the missing values of each gene from similar genes. IFRAA discussed here overcomes this 
disadvantage. 

IFRAA works as follows. First we use FRAA to find a completion G. Then we use a 
cluster algorithm, (we used K- means by repeating and refining the cluster size), to find a 
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reasonable number of clusters of similar genes. Presumably each cluster is a relatively smaller 
matrix having an effective low rank. For each cluster of genes we apply FRAA separately 
to recover the missing entries in this cluster. It turns out that this modification results in a 
very efficient algorithm for reconstructing the missing values of the gene expression matrix. 

We also note that IFRAA performs best in reconstructing missing values of n x m 
matrices, which have low effective ranks. These results suggest that IFRAA has a potential 
for being an effective algorithm to recover blurred spots in digital images. 

3 Results 

For comparison of different imputation algorithms, six different types of data sets were used, 
consisting of four microarray gene expression data and two randomly generated synthetic 
data. Two data sets of microarray were obtained from studies for the identification of cell- 
cycle regulated genes in yeast (Saccharomyces cerevisiae) [14]. The first gene expression 
data set is a complete matrix of 5986 genes and 14 experiments based on the Elutriation 
data set in [14]. The second microarray data set is based on Cdcl5 data set in [14], which 
contains 5611 genes and 24 experiments. Two other yeast data sets obtained from "http:// 
sgdlite.princeton.edudownloadyeast-datasets" . The Evolution data set has been studied in 
[4] and Calcineurin data set has been studied in [16]. Two synthetic data set was randomly 
generated matrices of size 2000 x 20 and ranks 2 and 8 respectively. 

To assess the performance of missing value estimation methods, we performed the follow- 
ing simulations. On the first two microarray data sets and on the synthetic data we deleted 
randomly 1%, 5%, 10%, 15% and 20% of the entries from the complete matrix C. Then 
we estimated the various completions of the missing values by BPCA, IFRAA and LLS. 
We set the K-value parameter (number of similar genes) such that there was no increase in 
performance of the LLS by increasing k. 

We used a normalized root mean square error (NRMSE) as a metric for comparison. If 
C represents the complete matrix and C represents the completed matrix using an estimate 
to the corrupted entries in C, then the root mean square error (RMSE) is where 

D = C — C. We normalized the root mean square error by dividing RMSE by the average 
value of the entries in C. 

In IFRAA the parameter L, which is the number of significant singular values plus 1, 
was chosen by comparison of ratio of two consequent singular values. We observed that 
this parameter appeared to be equal to 2 or 3 depending on data set and may differ for 
each small block of data (cluster). The initial guess for the missing entries in each gene was 
chosen to be the row average of its corresponding row. 

Figure 1 depicts the comparison of BPCA, IFRAA and LLS for Elutriation data set in 
[14] . We break the whole gene expression matrix by clustering the data into groups of genes, 
which form matrices with effective low ranks. We applied FRAA on each group. The graph 
is the average over 25 runs, and as can be seen for this data set IFRAA performed the best, 
BPCA and LLS have very close performance with significant gap with IFRAA. 

Figure 2 depicts the comparison of BPCA. and LLS for Cdcl5 data set in [14] which 
contains 5611 genes and 24 experiments. In this case IFRAA again performed the best and 
LLSimpute performed slightly better than BPCA. 

The performance of the BCPA, IFRAA and LLS algorithms depends on the unknown 
distribution of missing position of the entries. To study this issue we applied all methods on 
the original data sets containing missing values. Since NRMS error could not be calculated 
for these actual missing values, we transferred the missing value positions from the original 
data to corresponding positions in the complete data derived from the original data set 
before applying the algorithm. By doing this the distribution of missing value positions in 
complete data set is almost unchanged from the actual distribution. The result is illustrated 
in Table I for three data sets including the original data set of Cdcl5 which contains %0.7 
missing values (%0.81 missing in complete data), Evolution data set [4] which contains 
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Table 1: Comparison of NRMSE for three methods: IFRAA, LLS and BPCA for actual 
missing values distribution for three gene expression data sets with different percentages of 
missing values. 



Data sets 


IFRAA 


LLS 


BPCA 


Cdcl5 data set %0.81 missing 


0.0175 


0.0200 


0.0216 


Evolution data set %9.16 


0.0703 


0.0969 


0.1247 


Calcineurin data set %3.68 


0.0421 


0.0445 


0.0453 



%8.457 missing values (%9.1 missing in complete data) and Calcineurin data set [16] which 
contains %3.2 missing values (%3.68 missing in complete data). This result again confirms 
the superiority of the IFRAA for the actual microarray data missing value estimation. 
The random matrices of order 2000 x 20 and of ranks fc = 2,8 appearing in Figures 

3 and 4 were generated as follows. One generates 2k random column vectors xi, . . . ,Xfe G 
]^2000^ yi , . . . , Yfe G M^" , where the entries of these vectors are chosen according to an uniform 
distribution. Then C = Yl'i=i ^iVj ■ 

Figure 3 represents the comparisons of BPCA, IFRAA and LLS for 2000 x 20 random 
matrix of rank 2. The performance of the three algorithms is excellent for 1% of missing data. 
The performance of LLS constantly deteriorates with the increase percentage of missing data. 
The performance of BPCA deteriorated with the increase percentage of missing data, but 
less than LLS. IFRAA performed outstandingly. 

Figure 4 represents the comparisons of BPCA, IFRAA and LLS for 2000 x 20 random 
matrix of rank 8. The performance of LLS is the same as in Figure 3. BPCA and IFRAA 
performed extremely well. IFRAA slightly outperformed BPCA in particular in the case 
with 20% of missing data. 

4 Conclusions 

This paper describes the improved fixed rank approximation algorithm (IFRAA), a local- 
global algorithm which exploits the local similarity in data. We compared IFRAA to the 
Bayesian principal component analysis (BPCA) and the local least squares imputation 
method (LLS). We applied the three algorithms to several data sets. We corrupted, at 
random, certain percentages of these data sets and let the three algorithms BPCA, IFRAA 
and LLS recover them. We also applied the three algorithms on real gene expression data 
sets while keeping the distribution of missing values unchanged. 

We found that IFRAA performed better than BPCA and LLS for actual microarray miss- 
ing value estimation. In addition we observed that for microarray data sets LLS performed 
slightly better than BPCA. 

We also applied three algorithms on synthetic data sets, which were random 2000 x 20 
matrices of ranks 2 and 8. We again corrupted at random certain percentages of these data 
sets. IFRAA and BPCA were able to recover the data quite well, where IFRAA slightly 
outperformed BPCA, in particular in the case with of higher percentage of missing data. The 
performance of LLS deteriorated gradually with increasing percentage of missing entries. 

In conclusion IFRAA appears to be the most reliable method for recovering missing 
values in DNA microarray gene expression data. IFRAA was also the best to recover missing 
values in synthetic data, corresponding to a data matrix with an effectively low-rank. These 
results suggest that IFRAA has a potential for being an effective algorithm to recover blurred 
spots in digital images. 
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0.01 ' ' ' ' ' ' ' ' ' ' 1 

2 4 6 8 10 12 14 16 18 20 

% tnissifig value 

Figure 1: Comparison of NRMSE against percent of missing entries for three methods: 
IFRAA, BPCA and LLS. Elutriation data set in [14] with 14 samples. 



Cdc15dataset 



- IFRAA 

- LLS 




Figure 2: Comparison of NRMSE against percent of missing entries for three methods: 
IFRAA, BPCA and LLS. Cdcl5 data set in [14] with 24 samples. 



6 



Data matrix of rank 2 

0.045 I , , , , , r 




% Of missing 

Figure 3: Comparison of NRMSE against percent of missing entries for three methods: 
IFRAA, BPCA and LLS. Data set was a 2000 x 20 randomly generated matrix of rank 2. 



Data matrix of ranit 8 




Figure 4: Comparison of NRMSE against percent of missing entries for three methods: 
IFRAA, BPCA and LLS. Data set was a 2000 x 20 randomly generated matrix of rank 8. 
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